A method has been developed to calculate a single index for a build and platform, based upon the Raptor test results for various run_suite and corresponding metrics of the build. This method handles potentially missing run_suite measurements.
The method involves the usage of a reference:
loadtime and fcp of each sample (total of 25) for each run_suite.run_suite for all reference builds.run_suite.Once the CDFs have been calculated, scores can be calculated for every build:
run_suite build sample, calculate its \(p_{run\_suite}\) from the relevant eCDF.run_suite.The final score is a value from 0 to 1, with lower being better (e.g., decreased loadtime and/or fcp)
The following plot illustrates the result for a geomean loadtime and fcp across builds since February 1st for the Windows 10-64 platform.
The following analysis focuses upon Raptor page load metrics from warm runs, on the Windows platform, for mozilla-central builds.
A brief description of Raptor page load testing follows:
run_suite page load tests for each platform (e.g., Windows 10-64).
run_suite is a serialized snapshot of a web page (e.g., Amazon, Facebook) that is played back via Mitmproxy.run_suite test measures four metrics for each of the 25 samples: dcf, fcp, fnbpaint, loadtimeTherefore, the results for a build and platform is composed of n run_suite, 4 metrics, and 25 samples (\(\underset{n\times 4 \times 25}{\mathrm{X}})\).
A single score for a Firefox build should have two traits: (i) actionability, and (ii) interpretability. To produce a single score from Raptor page load tests requires multiple levels of aggregation:
run_suite into a single valueOne characteristic of Raptor testing is that there are many different cases of run_suite incompleteness, where one or more run_suite are not tested for a specific build.
These complicate aggregating the individual run_suite into a single value in a consistent manner. For this analysis, 13 of the most common run_suite were chosen to minimize these issues. Builds that were missing one or more run_suite were dropped from the analysis. 474 of builds had the complete set of run_suite out of 569 (83%).
The dropped and incomplete run_suite are apparent in following figure. NOTE: The scales for each facet are independent to illustrate the changes of these timings across build/time.